Compressed suffix tree - a basis for genome-scale sequence analysis
نویسندگان
چکیده
UNLABELLED Suffix tree is one of the most fundamental data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet sigma = {A, C, G, T} can be stored in n log absolute value(sigma) = 2n bits, its suffix tree occupies O(n log n) bits. In practice, the size difference easily reaches factor 50. We provide an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log) absolute value(sigma)) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, e.g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of normal suffix tree. Typical operations are slowed down by factor 60. AVAILABILITY The C++ implementation under GNU license is available at http://www.cs.helsinki.fi/group/suds/cst/. An example program implementing a typical pattern discovery task is included. Experimental results in this note correspond to version 0.95.
منابع مشابه
Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes
Exact string matching is a problem that computer programmers face on a regular basis, and full-text indexes like the suffix tree or the suffix array provide fast string search over large texts. In the last decade, research on compressed indexes has flourished because the main problem in large-scale applications is the space consumption of the index. Nowadays, the most successful compressed inde...
متن کاملSpace-Economical Algorithms for Finding Maximal Unique Matches
We show space-economical algorithms for finding maximal unique matches (MUM’s) between two strings which are important in large scale genome sequence alignment problems. Our algorithms require only O(n) bits (O(n/ log n) words) where n is the total length of the strings. We propose three algorithms for different inputs: In case the input is only the strings, their compressed suffix array, or th...
متن کاملCompressed Suffix Trees for Repetitive Texts
We design a new compressed suffix tree specifically tailored to highly repetitive text collections. This is particularly useful for sequence analysis on large collections of genomes of the close species. We build on an existing compressed suffix tree that applies statistical compression, and modify it so that it works on the grammar-compressed version of the longest common prefix array, whose d...
متن کاملStorage and Retrieval of Highly Repetitive Sequence Collections
A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occ...
متن کاملString Searching in Referentially Compressed Genomes
Background:Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 23 5 شماره
صفحات -
تاریخ انتشار 2007